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Abstract 



Bayesian Optimization (BO) aims at optimizing an unknown function that is costly to evaluate. We focus on 
applications where concurrent function evaluations are possible. In such cases, BO could choose to either sequentially 
evaluate the function {sequential mode) or evaluate the function at a batch of multiple inputs at once (batch mode). 
The sequential mode generally leads to better optimization performance as each function evaluation is selected with 
more information, whereas the batch mode is more time efficient (smaller number of iterations). Our goal is to 
combine the strength of both settings. We systematically analyze BO using a Gaussian Process as the posterior 
estimator and provide a hybrid algorithm that dynamically switches between sequential and batch with variable batch 
sizes. We theoretically justify our algorithm and present experimental results on eight benchmark BO problems. The 
results show that our method achieves substantial speedup (up to 78%) compared to sequential, without suffering any 
significant performance loss. 



1 Introduction 

Bayesian optimization tries to optimize an unknown function /(•) by requesting a set of experiments when /(•) is 
costly to evaluate |I8]|4|. In this work, we are interested in finding a point x* e X"^ such that: 



where X"^ is our d-dimensional compact input space and /(•) is the non-concave underlying function which has mul- 
tiple local optima. The function /(•) might be the performance of a black box device characterized by input x. For 
example, in our motivating application we try to optimize the power output of nano-enhanced Microbial Fuel Cells 
(MFCs). MFCs |3 | use micro-organisms to generate electricity. It has been shown that efficiency of generated elec- 
tricity power significantly depends on the surface properties of the anode [12 1. Our problem involves optimizing the 
surface properties of the anodes in order to maximize the output power. The goal is to develop an efficient BO algorithm 
for this application since running an experiment is very expensive and time consuming. 

Focusing on the task of function maximization, each run of BO consists of two main steps: estimating the values 
of the unknown function /( •) via a probabilistic model such as GP, and selecting the best next experiment(s) according 
to the probabilistic model via some selection criterion. The results of the experiment(s) are then be added to update the 
probabilistic model and this cycle is repeated until we meet a stopping criterion. 

Most of the proposed selection criteria in BO are sequential, where only one experiment is selected at each iteration 
lfTn i8][T4ll9l. Sequential policies usually perform very well in practice, since they optimize the experiment selection at 
each iteration by using the maximum available information for each experiment. However, they are not time efficient 
in many applications where running an experiment takes a long time, and we have the capability to run multiple 
experiments in parallel. This motivates the batch algorithms in which more than one experiment is selected at each 
iteration. 

Recently, Azimi et al. |2 1 introduced a batch BO approach that selects a batch of k experiments at each iteration that 
approximates the behavior of a given sequential heuristic. Ginsbourger et al. [7| introduced a constant liar heuristic 
algorithm to select a batch of experiments based on the Expected Improvement (EI) ||9J policy. Specifically, after 



X* = argmax/(a;), 



(1) 
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selecting an experiment by EI, the output of the selected point is set to a constant value. This experiment is then added 
to the prior and the procedure is repeated until k experiments are selected. Although these two batch algorithms |I2]|71 
can speedup the experiment selection by a factor of k, their results show that batch selection in general performs worse 
than the sequential EI policy, especially when the total number of experiments is small. This observation motivates 
us to introduce a Hybrid BO approach that dynamically alternates between sequential and batch selection to achieve 
improved time efficiency over sequential without degrading the optimization performance. 

In this paper, we focus on a class of batch policies that is based on simulating a sequential policy and provide 
a systematic approach to analyze such batch BO policies. We analytically connect the mismatch between the BO's 
probabilistic model and the underlying true function to the performance of the batch policy. We provide full char- 
acterization of simulated-based batch policies when the batch size is 2. For the purpose of illustration, consider a 
batch policy that selects 2 experiments. The first experiment matches the sequential policy. The choice of the second 
experiment, however, will depend on what is the simulated outcome of the first experiment. We show that the distance 
between the second experiment picked by a simulation-based batch policy (without the knowledge of the output of the 
first experiment) and the one picked by the sequential policy (with the knowledge of the output of the first experiment) 
is upper-bounded by a quantity that is proportional to the square root of the estimation error (of the outcome of the first 
experiment). 

This analysis naturally gives rise to our hybrid batch/sequential algorithm. Our algorithm works as follows: At 
each step, given any sequential policy (EI in this paper), find the best next single experiment and estimate its possible 
outcome via BO's probabilistic model (GP in this paper). Then, update the prior with that point and choose the next 
best single experiment and so on. We analytically show that this process can be continued until a certain stopping 
criterion is met. This stopping criterion measures how much a simulated experiment is going to bias our probabilistic 
model (mainly because of inaccuracy in estimation of the outcomes of the first experiment). If the bias is small, we 
continue to add more examples to our batch; and if it is large, we stop. 

The proposed algorithm has the appealing property that it behaves more like a sequential policy in early stages 
when the number of observed experiments is small, and naturally transits to batch mode in later stages when more 
experiments are available. This is because the stopping criterion tends to be more stringent in early stages because 
the bias of the prior can be potentially large, forcing the algorithm to act sequentially. The beauty of this algorithm is 
that it evolves from a sequential algorithm to a batch algorithm in an optimal manner characterized by our theoretical 
results. 

Experimental results show that the proposed algorithm can achieve up to 78% speedup over the sequential policy 
without degrading the performance even with a very small number of experiments. We also show that, by increasing 
the number of experiments, the speedup rate is increased significantly which is consistent with the theoretical results 
presented in the paper. 

The paper is organized as follows. We introduce the Gaussian Process which is used as our model in Section |2] 
The proposed dynamic batch algorithm is described in Section |3] Section |4]presents the experimental results and the 
paper is concluded in Section |5] 

2 Gaussian Process 

A BO algorithm has two main ingredients: a probabilistic model for the unknown function, and, a selection criterion 
for choosing next best experiment(s) based on the model. We select GP 1 13J as our probabilistic model and EI [9j as 
our selection criterion. We study the properties of GP in this section and postpone the analysis of EI to the next section. 

We use GP to build the posterior over the outcome values given our observation set O = {xq^Uq), where, 
xo — X2, . . . , Xn} is the set of inputs and — {yi, y2, ■ ■ ■ , yn} is the set of outcomes (of the experiment) such 
that yj = f{xj) and /(•) is the underlying unknown function. 

For anew input point Xi, GP models the unknown output y.^ — f{xi) as a normal random variable yi ~ Af{^x.\Q, a'^^^), 

with = k{xi,Xo)k(xo,Xoy^yo anda^^i^^ = k{xi,Xi) - k{xi,Xo)k{xo,Xoy^k{xo,x,), where, •) is 

any arbitrary kernel function. 

Definition 1. Let x ~ {xi, X2, ■ ■ ■ , Xm} X \ xq be any unobserved set of points. Let y — {yi, y2, ■ ■ . , ym} be our 
estimate of their outputs based on GP considering yi\0 ^ J^ilJ-Xi\Oj '^x\o)' '^"^ "^'^ point z G A" \ {xo U x}, 
let y^\0 ^ Af{fi^io,<^l\o) andy^lO, {x,y) ^ Af{flz\o,a:,'^l\o,J- 

Under the GP model, the variance of a point z depends only on the location of the observed points and is inde- 
pendent of their outputs, i.e., (t^|„ ^ — ct'^iq x- Therefore, we can update the variance of any point z after finalizing 
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our new query set x without the knowledge of their true outputs y = f{x). The following theorem characterizes the 
change in the variance of z if we query x. 

Theorem 1. Assuming A{az) ■= ct'^^q — (t'^^q ^, we have 

A(c7^) = {CA-^B'^- k{z,x)) D {CA-^B'^- k{z,x)f, (2) 
where, B = k{x, xo), A = k{xo,xo), C = k{z, xo) and D = {k{x, x) — BA~^B^)~^. 

From a practical point of view, this theorem enables us to update the variance of z via computing the difference 
A(o-^) and add it to the previous value. This scheme is much faster than recalculating the variance of z directly. 
The computational bottleneck of this update is only the matrix inversion in D with complexity 0{rn^), considering 
the fact that k{xo-,xo)~^ has been computed before, while the complexity of the direct variance computation is 

O {{n + mf). 

The actual expected value ijLz\o,x heavily depends on the true outputs y = f{x), which are not available. Without 
the knowledge of the true outputs, we make an estimation ^z\o,x based on the GP-suggested output values y. We 
bound this estimation error in the next theorem. 

Theorem 2. Let^i^ = \\i.Kz,x) - CA-^B^)D\\^. Then, 

\iJ-z\o,x-V'z\o,x\<iz ||y-y||2 
\iJ'z\o,x - y'z\o\ < 7z||y - MccioIIj- 

Here, || • ||2 is vector 2-norm. This theorem tells us that our estimation error at point z is proportional to the 
parameter 7^, which is known to us without the knowledge of y. Intuitively, if 7^ is small, we would think that our 
estimation 'llz\o,x is accurate and hence, we can make our decision about the point z without knowing y, i.e., before 
the result of experiment on x returns. This observation tells us that it is possible to do batch BO without a big loss in 
performance. 

Remark: If we want to minimize our estimation error of 'Jlz\o,x in expectation, we should set y = fJ.x\o- This is 
in some sense trivial and even counter intuitive. One might claim that if the unknown function is upper-bounded by 
M, then the best choice for y is M since it increases the expected value around the optimal point in the GP model. 
However, this theorem shows that this choice is overly optimistic. 

The previous theorem provides a performance bound based on our estimation error on y, however, from a practical 
point of view, that bound cannot be computed since we do not know the exact values of y. As a practical measure, we 
would like to focus on the expected value of the estimation error as opposed to the error itself Next corollary provides 
an upper-bound on the expected error, by simply taking expectation from the result of theorem 2. 

Corollary 1. Let 6^ := ^J^Zr^^l~\o' 

^y[\l^z\0,x - l^z\o\] <lzOx- 

Moreover, 

[\lJ-z\0,x - V-z\0,x'^ < Iz {6x + \\y- Mxiolh) • 

Remark 1: We focus on the second bound in this corollary, which has two terms 
"how close" the point z is to x. The second term captures the bias of our estimator y 
best choice for y is the mean n^io- 

Remark 2: This corollary entails that if for some small value of e, we have 

7z {6x + \\y - IJ'xloh) < e, (3) 

then, we are guaranteed that 

[\^J.z\o,x - < e. 

Since 72 and 9x are both computable without the knowledge of y, this observation motivates us to use this as a stopping 
criterion for our algorithm to determine if the current estimation bias is too large to continue selecting more examples 
in the batch. In the nutshell, when we want to query a batch of samples, if this criterion is met, we are sure that our 
estimation of y is accurate and hence, we do not need to wait for the label of the selected examples before making the 
next selection. 



. The first term (jz^x) measures 
. According to this corollary, the 
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3 Hybrid Batch Bayesian Optimization 



In a sequential approach, we query for only one experiment at a time using a selection criterion (policy), mainly 
because the selection criterion requires the output of the previous query to find the next best one. Suppose we have 
the capabiUty of running ni, experiments in parallel, and we are limited by the total number of possible experiments 
ni. At each iteration, the question is whether or not we can query more than one sample to speed up the experimental 
procedure without losing performance comparing to the sequential approach. 

We use Expected Improvement (EI) as our base sequential selection criterion. Below we provide the formal defini- 
tion for EI. 

Definition 2. EI^ at point x with associated GP prediction y\0 ^ ■^{t^x\o^ ^t\o^ defined to be 

EI{x\0)^(^-u^-u) + ct>{u)y.,\o, (4) 

where, u = {Umax ~ fJ-xlo) / '^x\o '^^d Umax = max iji. Also, <&(•) and 4>{-) represent standard Gaussian distribution 
and density functions respectively. 

Our proposed algorithm selects a batch (possibly one) of samples at each iteration based on the EI policy, where the 
batch size is dynamically determined at each step. In particular, the algorithm will continue to select more experiments 
if the condition in ([3]) is satisfied for the select point z. 

To explain the algorithm, suppose we are at the beginning of the first round of the algorithm. Thus far, we have 
observed Uq ~ f{xo) at some randomly chosen sample points xa- To form our batch query, we start from an empty 
set of samples and gradually add the next best sample one at a time. The first sample we pick (xi) is identical to the 
first sample that sequential EI picks (a;|), simply because both maximize the same objective, i.e., xi = x'l. To pick our 
second sample, we estimate yl = f{xl) by some value yi. This estimation, changes the EI function of all unobserved 
points to some EI function formulated as 



EI{z\0,xl) = (-u$(-w) + (^(u))a,|o,,., 



where, u = —. This is different from the true EI function: 



EI{z\0,xl) = (-u$(-u) + 0(u))ct;,|o,^j, 



where, u = —. Obviously, optimizing EI might not lead to the optimum of the true EI. However, 

the next lemma shows that these two functions are close to each other for a good estimation jji. 

Lemma 1. At any point z, we have 

\eI{z\0,xI)-EI{z\0,xI)\ < \ + \yi-yl\. (5) 

In the light of this lemma, there is hope that X2 = arg max EI (a potential batch sample from our algorithm) is 
close to X2 — arg max EI (the optimal sample picked by sequential policy). The next theorem bounds the error of 
our algorithm in terms of the second selected point in comparison to the sequential EI. 

Tlieorem 3. Let Emin be the minimum singular value of the Hessian matrix on the line intersecting X2 and 

X2. Then, 

II . 11^ / 2 / , max(cr^2|o,(7^.|C))\ 1 1 

F2-a;2 < — iH \yi-yi\- (6) 

II II2 2.min Y <^xl\0 J I 

Here X2 is the second point selected by our simulation based batch method without knowing the outcome of xi, 
whereas X2 is the second point selected by the sequential EI method after knowing the outcome of xi. 

Remark 1: The parameter Emin captures the curvature of the EI function around its optimal point X2. This 
curvature cannot be zero unless X2 is very far from X2, which is very unlikely due to the closeness of their expected 
values (see Corollary 1). 

Remark 2: This theorem shows that the sample estimation error is proportional to the square root of the estimation 
error of yl . This means that the sample estimation is more sensitive to the output estimation error for functions taking 
value in [0, 1]. 
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This line of analysis can be extended to next samples. These results show that an algorithm based on the estimation 
can be successful. In practice, after we optimized EI for X2, then, we check the condition ([3]) (i.e., •^^2 {(^xi + Wvi ^ 
/^j/lolU) 1^ e) and if this condition is satisfied, we add x-z to our batch query and move on to and so on. Algorithm 
[TJsummarizes our proposed method for hybrid batch Bayesian optimization. 

Algorithm 1 Hybrid Batch Expected Improvement 

Input: Total budget of experiments (ni), maximum batch size (nt), the predictor (y), current observation O — {xo,yo) and 
stopping threshold e. 
while rii > do 

x\ <— arg max EI{x\0). 

A-^(xj,yi), n; ^ n; - 1. 
2 arg max EI{x\0 U A). 

while (7z(6'x^ + \\yA - A*a;^|ol|2) < e) and (n; > 0) and (|^| < nt) do 

^ arg max EI{x\OuA). 
end while 

RunExperiment(a;^) 

end while 
return max(yQ) 

In early stages, this algorithm behaves more like a sequential policy since the criterion for building up a batch 
is very hard to satisfy, mainly because 0^ is large when we have only a few samples in O. After collecting enough 
samples, the term 0^ starts decreasing and as it gets closer and closer to zero, we can select larger and larger batch 
sizes. Thus, the algorithm gradually transits into a batch policy while maintaining a close match to the performance to 
the pure sequential poUcy. 

4 Experimental Results 




Fuel Cell Hydrogen 



Figure 1 : The contour plot for FuelCell and Hydrogen. 

Benchmarks. We consider 6 well-known synthetic benchmark functions: Cosines and Rosenbrock |[l] H) over 
[0, l]^ Hartman(3)\6\ over [0, 1]^ Hartman(6)M over [0, l]^ Shekel\5\ over [3, 6]^^ Michalewicz GO] over [0, Tr]^. 
The analytic expression for these functions are shown in Table [T| 

The other two real benchmarks are Fuel Cell and Hydrogen. In Fuel Cell, the goal is to maximize the generated 
electricity from microbial fuel cells with by changing the nano structure properties of the anodes. We fit a regression 
model on the data to build our function / (•) for evaluation. In Hydrogen benchmark, the data has been collected as part 
of a study on Hydrogen production from a particular bacteria where the goal is to maximize the amount of Hydrogen 
production by optimizing the PH and Nitrogen levels of growth medium. Both Fuel cell and Hydrogen data are in 
[0, 1]^. Their contour plots are shown in Figure[r| 

Setting. We use a GP using a zero-mean prior and Gaussian kernel function k{x, y) — exp(— y \\ x — y with 
kernel width I = Q.QVEf^^li, where, U is the length of the i*'' dimension fT]. For this kernel function, we can directly 
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Table 1 : Benchmark Functions 



Cosines(2) 


1— {u"'+ — 0.3 cos{37ni) — 0.3 cos(37rii)) 
u = 1.6a: — 0.5, v = 1.6y — 0.5 


Rosenbrock(2) 


10-100(1/- x2)2_(l- xf 


Hartman(3,6) 


Eti ^^ exp (-Ei=i [XJ - P,, )2) 
f^ix4, ^4x£ii f 4 X d 'constants 


Michalewicz(5) 


-Ei=iSin{x,)sm ( ^ 1 


Shekel(4) 


T,t=l u;i+s,_i4"(a:, -B,,)^ x 10 , B4a X 10 are Constants 



Table 2: Benchmarks Performance 





Cosines 


Hydrogen 


FC 


Rosenbrocls 


Hartman 3 


Miclialewicz 


Sliekel 


Hartman 6 


Sequential 


0.223 


0.048 


0.211 


0.013 


0.042 


0.431 


0.389 


0.263 


Random 


0.490 


0.282 


0.307 


0.485 


0.206 


0.607 


0.680 


0.505 


y = M 


0.223 


0.048 


0.211 


0.014 


0.040 


0.429 


0.386 


0.270 


Speedup 


2% 


4% 


3% 


3% 


2% 


2% 


10% 


2% 


?/ = (1 + C)S/max 


0.222 


0.049 


0.214 


0.012 


0.044 


0.438 


0.401 


0.263 


Speedup 


22% 


14% 


5% 


10% 


6% 


7% 


19% 


7% 


?/ — Vmax 


0.210 


0.050 


0.219 


0.013 


0.040 


0.440 


0.375 


0.276 


Speedup 


23% 


15% 


5% 


10% 


11% 


12% 


25% 


13% 


y = A 


0.222 


0.050 


0.214 


0.011 


0.052 


0.450 


0.412 


0.271 


Speedup 


45% 


57% 


43% 


37% 


70% 


77% 


78% 


75% 


y — Vinin 


0.212 


0.050 


0.213 


0.011 


0.067 


0.444 


0.430 


0.283 


Speedup 


38% 


50% 


32% 


18% 


54% 


75% 


77% 


72% 


y = random 


0.212 


0.050 


0.211 


0.012 


0.047 


0.440 


0.382 


0.284 


Speedup 


39% 


38% 


20% 


20% 


47% 


58% 


60% 


58% 


Matching 


0.295 


0.085 


0.246 


0.012 


0.078 


0.430 


0.521 


0.320 


CL(A) 


0.301 


0.084 


0.257 


0.012 


0.081 


0.451 


0.551 


0.319 



drive the next two corollaries from theorems 1, 2. 

II a; _ y II 2 

Corollary 2. For all points z ^ X \ {O, x^}, and kernel function k{x, y) — e ' , we have A(az) > e if 

\\ z - xi ||^< -Hn ^^/n || A~^B^ \\2 +cr^* |ci\/ej . 

This corollary entails that after selecting the first experiment xl, the set of points z such that A{(Jz) > e are located 
inside a hyper sphere centered at xl- In other words, those inside the hyper sphere are those whose variance is affected 
significantly (more than e) when x^ is selected. 

Corollary 3. Under the assumption of Corollary^ we have V\\pLz\o,x ~ V-z\o,x\] ^ ^ 

f<-nn G^-nM-iB^ \\l 

Similar to corollary|2] the corollary |3]represents a hyper sphere centered at xl and the points which are inside the 
hyper sphere are those whose expected values are affected more than e when xl is selected. 

We run our algorithm on each benchmark for 100 independent times and the average simple regret is reported 
as the result. The simple regret is the difference between the maximum value of /(•), denoted by M, and Umax after 
finishing the experimental procedure. In each run, the algorithm starts with 2 initial random points for 2, 3-dimensional 
benchmarks and 5 initial random points for higher dimensional benchmarks. The total number of experiments ni is set 
to 15 for 2, 3-dimensional and 30 for the higher dimensional benchmarks. The maximum batch size at each iteration, 
nb, is set to 5. The parameter e is set to 0.02 for 2, 3-dimensional and 0.2 for higher dimensional benchmarks. Note 
that, our experimental setup is designed to match typical scenarios encountered in real applications, where we typically 
start with a very small number of random experiments, and are restricted with a total budget. 

Results. Our algorithm requires us to select a specific estimation for y. Recall that our theoretical analysis from 
Theorem 2 suggests that to minimize the estimation error of Jiz\o.x in expectation, we should use y = iJ.x\o- Here we 
hope to confirm this by comparing different possible estimations for y. In particular, we consider 6 different estimations 
of y including: I) y — M, which means we expect to observe the best possible output for each experiment selected by 
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Fuel Cell 



Hydrogen 



Cosines 



Rosenbrock 




EI; 2) y = Umax, where ymax = maxj,.gy^ y,; is our current best observation; 3) y = (1 + Qumax, which means each 
step of EI algorithm is expected to improve the best current observation by margin we set the value of C to 0.1 in 
our experiment; 4) y = /i^io, which means we set the value of y to be the expected output at that point; 5) y = ymin, 
where y„iin — "^^"^ytevo Vi current minimum observed output; and 6) y = random, which set y to a uniform 
random value drawn in [ymin, ymax]- 

To demonstrate the effectiveness of our algorithm, we consider two state-of-the-art batch BO algorithms in the 
literature: 1) simulation matching (Matching) |2| and 2) the constant liar approach in which the output of the selected 
samples in the batch is set to their mean in order to select the next experiment (CL(/i)) |7|. For both methods, we 
set the batch size to fc = 5. We have also reported the performance of the sequential EI and pure random selection 
policies. 

The speedup of our proposed approach is calculated as the percentage of the samples in the whole experiment that 
are selected in batch mode. More specifically, if we finish ni samples in T steps, the speedup is calculated as 1 — 
Clearly, the maximum speedup in our setting is %80, that can be only achieved if we select 5 experiments at each time 
steps. For example, the speedup of proposed baseline batch approaches. Matching and CL(/t), are %80. Table|2]shows 
the result. 

Interestingly, all of the 6 considered estimators achieved similar performance (comparable to EI) in terms of their 
regrets. The key difference between the different estimators is the level of speedup they achieve. In particular, we 
observe that the most speedup is achieved by y = V'x\o^ for which we are able to produce over 70% speedup (very 
close to fully batch) for the three high dimensional functions Michalewicz, Shekel and Hartman 6. 

Further inspection of the speedup rates reveal that setting y to a large value, for example M, ymax, and {1+C)ymax, 
generally leads to less speedup than the other choices. This can be explained by noting that a large value of y will lead 
to higher chance of violating the condition required for making the next experiment selection in Algorithm 1, which is 
stated in Equation 3. In particular, for a large y, the next point selected by EI will most likely be very close to x, since 
the mean of the points close to x are high. This will lead to a large 7^. Further, the quantity ||y — /ixiolb is likely 
very large. Consequently, it is easy to violate this condition thus stop the selection process early on. In contrast, if 
y = ymin, although 1 1 y — || 2 is large, we expect 7^ to be small because the next point z selected by EI will likely 
to be far away from x since the mean and variance of the points close to x are very small. Considering the two terms 
jointly, we expect to achieve a higher speedup by setting y = ymin comparing to setting y to a large value, which is 
exactly what we observe in our experiments. Finally, by setting y to IjLx\o, we have ||y — /x^|o||2 = and the stopping 
criterion only depends on "fzOx- Thus we expect to achieve the maximum speedup among the different choices we 
consider for y. 

Our experimental investigation shows that the size of the batch generally increases as the experiment goes forward. 
This is consistent with our theoretical results in which the value of 7^ [Ox + \\y — Mxiolb) decreases as the variances 
decreases. Note that, sampling at any arbitrary point when the number of observations is small would change the 
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Fuel Cell Hydrogen Cosines Rosenbrock 




2(1 25 30 10 2(1 30 40 50 m "'^^O 20 30 40 50 60 "iu 20 30 40 

# of ExjHTimcIs ff cif Evpcrimvls # nf Exptriinels # of Ksptrimuls 



Hartman(3) Shekel Michalewicz Hartman(6) 

Figure 3: The performance of different batch algorithms for batch size 10. 

variance of the input space significantly comparing to the case where there are a lot of observation points. Therefore, 
the stopping criteria of Algorithm 1 is less likely to be met in the early stages of the experimental procedure where 
there are a few observation points. 

The /i-Constant Batch Approach. This part of the experiments is motivated by our theoretical analysis and the 
goal is to shed some lights on a batch method recently proposed by Ginsbourger et al. |7|, which selects a batch of 
experiments that jointly maximize the EI objective. They show that finding such a batch of experiments is practically 
intractable. Therefore, they introduced a heuristic approach called Constant liar to select a batch of k experiments. 
After selecting the first experiment. Constant liar sets the output of the selected experiment as a constant value c. That 
experiment is then added to the set of observations and the next experiment is selected. This procedure is repeated 
until k experiments are selected. They introduced several possible ways for setting c, including c = M, c = 'jl and 
c — Umin- They empirically demonstrated that setting c — M provided them a good result for their particular test 
functions. However, there is no theoretical justification or guidance toward what is the best c. 

Our theoretical analysis, in particular Corollary 1, indicates that by setting c iy in this paper) to fix\o, the condition 
for continued experiment selection can be easily met comparing to other settings, i.e., ^zdx < £■ Thus, a batch of 
k > 1 experiments are requested at most iterations without degrading the performance. This theoretical result also 
justifies the choice of setting c = V'x\o ™ the constant liar approach. We call this approach /i-Constant Batch. We 
run this algorithm on proposed 8 benchmarks for different batch sizes 5 and 10. Figures |2] and |3] show the performance 
of /x-Constant along with 5 competitive approaches: 1) Sequential EI; 2) Constant liar with y = M; 3) Constant liar 
with y = ymax', 4) Constant liar with y = ymin', and 5) Matching, which is a recently proposed approach by Azimi 
et al. 12j. For this set of experiments, we use the same experimental setup as used in Table |2] 

The results show that the /u-constant batch approach performs very competitively compared to the Matching ap- 
proach, which is one of the best existing batch Bayesian optimization approach in the literature. In addition, it is more 
practical than the Matching approach for high dimensional applications since its computational complexity is signifi- 
cantly less than the Matching algorithm. Note that the performance of /i-Constant is also shown in Table |2] as CL(/i). 
It is worth emphasizing that while /i-Constant achieves highly competitive batch performance, it is consistently worse 
than sequential EI and the proposed Hybrid Batch EI algorithm. This result suggests that the stopping criterion used in 
Algorithm 1 is in fact effective toward identifying the condition under which we must stop increasing the batch size to 
avoid significant performance degradation compared to the sequential EI. 

5 Conclusion 

In the Bayesian optimization framework, we investigated the problem of batch query selection with the goal of main- 
taining the performance of a sequential policy which using fewer iterations. Although our results are for general BO 
problems, for the sake of clarity, we focused on the task of maximizing an unknown non-convex/concave function. 



There are two main contributions in this paper. 

Firstly, we introduce a systematic way to analyze the performance and Umits of simulation-based batch BO methods 
by a) proving universal bounds on the bias caused by the simulation (estimation-of-outcome) error; and b) analyzing 
the selection of the second experiment when we have an estimate of the outcome of the first experiment. In all cases, we 
provide theoretical bounds on the error, relating the simulation error to the prediction error of the next best experiment. 

Secondly, based on the analysis above, we proposed an algorithm that behaves optimally in expectation. This 
algorithm at each step decides whether or not to pick another query to add to the current batch, and as such dynamically 
determines the appropriate batch size at each step. In early iterations, our algorithm behaves more similar to the 
sequential poUcy and gradually moves toward a batch policy with variable batch sizes. 

The empirical evaluation over both synthetic and real data shows substantial speedup (up to 78% ) compared to 
the corresponding sequential policy, with little to nothing loss in the optimization performance. Our theoretical results 
also shed some interesting light on the Constant-liar approach, a recently proposed batch selection method based on 
the EI objective. 
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A Proof of Theorem 1 

Recalling the notation introduced in the Theorem statement, we have 

A(c7^) = CA-^C^ - [C k{z, x)] 

= CA-^C'^-[C k{z,x)] 

= [CA-^B'^ - k{z,x)) D {BA-^C'^ - k{z,x)) 
This concludes the proof of the theorem. 

B Proof of Theorem! 

By definition and block matrix inversion lemma, we have 

lJ-z\o,x - Tj-zICx = k{^, {xo, x})k{{xo, x}, {xo, x})~'^ 
= {k{z,x)-CA-^B^)D{y-y). 

For the second part, we have 



k{z, x) 





y-y 



l^z\o - IJ'z\o,x = CA ^yc, - [C k{z, x)] 

= CA-^yo-[C k{z,x)] 
= {CA-^B"^ - k{z, x)) D {BA-'vo - y) 



1 






y 



A B^ 
B k{x,x) 

A-i + A-^B'^DBA-^ -A-^B'^D 
-DBA-^ D 



yo 
y* 



= {CA-^B"^ - k{z, x)) D {iJi^\o - y) 
This concludes the proof of the theorem. 



C Proof of Lemma 1 

Let = m.ax.{y^ax,yl) - IJ-z\o,xi ■ Using Theorem 2, we have 
:= va.Bx.{y^ax,yi) - V-z\0,xi 

= max{ymax,yi) - IJ-z\0,xl + ^^y^iVmax, Vl) " ^^{Vmax, Vl) 

J— (k{z, xl) - k{z, xo)k{xo, xo)~^k{xo,xl)) {yi - y*) 

= A^ + max(t/„„3;, yi) - max(y„„^, - p^ ^> (yi - yX) 

<^x\\0 

V ' 

= A^ + 5^. 

Here, Pz,x\ represents the correlation coefficient between x and x\. Thus, we have 

l^.|<fi + ^) \v.-v\\. 

V ^xl\o] 
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By mean- value theorem, there exists a E [0,1], such that 



A, 



A, 



= -A,$ -- 



A, 



A, 



V '^z\0,xl I 



EI{z) 



EI{z) 



Thus, 



EI{z) - EI{z) 



= $ - 



A^ + a(5^ 

^z\0,xl 



1 

< 

- 2 



<Mi + ^ 

2 \ CTajjIo 



This concludes the Proof of Lemma. 



D Proof of Theorem 3 

By optimality of X2 and Xj, we have 

EI{x2) - EI{X2) < EI{xl) ~ EI{x2) < EI{xl) - EI{xl). 
Using Lemma 1, we get 



EI{xl) - EI{x2) 



< ^ ( ^j^^^^^^^2\0,'^xl\o) 

~ 2 I tT3,.|ci 



We can continue 



EI{X2) - EI{x*2) < EI{X2) - EI{x*2) + EI{x*2) - EI{x*2) 

max(c7a;2|o,f7^.|o) 



< 1 



<^xl\0 



yi ~ vi 



By optimality of the derivative of El is zero at X2 and Taylor series expansion yields that for some a e [0, 1], 
we have ^ 

EI{x*2) - EIix2) = -{x; - X2f -j^ ((1 " a)x* + ax2j {x^ - X2). 



Finally, we get 



X2 - X2 



< 



dx'^ 

EI{X*2) - EI{X2) 



' I]n,i„(^((l-a)a;*+aa;2)) 
" / niax((T^2|O'0-2:*|o) 



(7) 



< 



'^xl\0 



yi - Vi 



E Proof of Corollary 2 



From theorem [T] there is an interesting finding which shows that the difference of variance of any point z in the input 
space after adding the point x* to our observation set is exactly D {k{z, xl) — BA^^C^) if we consider xl as a 
single point. Since 6^ — 8^* > 0, therefore m > 0. In addition, when |a;*| — 1, it can be shown that m^^ = ct*^. 
Thus, we are interested in the points where (5^ — S*'^ > e > 0. Therefore we have: 



Dk{xl,zf - {2DCA-^B^) k{xl,z) + {D{CA-^B^f) - e > 



(8) 
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this is a quadratic function of k{xl , z) with 2 real roots as follow: 



So we are interested in the region where k{x\,z) > ri or k{xl,z) < r2. For large value of e the r2 < and since 
k{xl,z) > 0, we are only interested in where k{xl, z) > ri. Therefore we have 

1 > k{xl,z) = e > CA-^B^ + ^ > (10) 

We are trying to introduce an upper bound for n which is free from P^. Clearly CA~^B'^ < \CA~^B'^\. Then 
we have, 

\CA-^B^\ =11 CA~^B^ II2 



< 1 1 C I j 2 1 1 ^~ ^ 1 1 2 Cauchy-Shwrz inequaUty 

<V^||C||oo|| A-'B^ II2 
<V^\\A-^B^\\2 since 0< II C ||oo< 1 
Therefore we are certain about the point satisfying the following equation 



D 



k{xl,z)>V^\\ A-^B^ II2+1 

<^lln (^V^\\A-'B^ 



z-x* f 



\\z-x* f < -I In (V^ II A-^B^ II2 

F Proof of Corollary 3 

II (CA-^B^ - k{xl,z)) D\\^ y|||(T,j|o||i > e 

I {CA-^B^ - k{xl,z)) I > ^ 

yf |<^a!l|o|-D 

/ \ ^ 

1 dT I,/-™* ^\m2 



|(CA-^B^ -A;(a;t,^))|^ > 



e 

2 



^1 1^ 



2 

1 dT ||2 



fcK,^r>5-6 nM-^B ,12 



z-x* f <_nn -^-n|| A-iB^ Hi 

V "^^xtlO 



Note that |a - 6p < 2 * (a^ + 6^). Therefore E[|/i^|o^a, - /i2|o,a!|] > e if we have 



7re^ 



(11) 



(12) 



(13) 



z - X* f < In - n II A-iB^ Hi (14) 

V ^^xl\0 
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